We have already provided some rules to follow as we created plots for our examples. Here we aim to provide some general principles we can use as a guide for effective data visualization. Much of this section is based on a talk by Karl Broman titled “Creating effective figures and tables” including some of the figures which were made with code that Karl makes available on his GitHub repository, and class notes from Peter Aldhous’ Introduction to Data Visualization course.

Following Karl’s approach, we show some examples of plot styles we should avoid, explain how to improve them, and use these as motivation for a list of principles.We compare and contrast plots that follow these principles to those that don’t.

The principles are mostly based on research related to how humans detect patterns and make visual comparisons. The preferred approaches are those that best fit the way our brains process visual information. When deciding on a visualization approach it is also important to keep our goal in mind. We may be comparing a viewable number of quantities, describing a distribution for categories or numeric values, comparing the data from two groups, or describing the relationship between two variables.

As a final note, we also note that for a data scientist it is important to adapt and optimize graphs to the audience. For example, an exploratory plot made for ourselves will be different than a chart intended to communicate a finding to a general audience.

We will be using these libraries:

library(tidyverse)
library(gridExtra)
library(dslabs)
ds_theme_set()

Encoding data using visual cues

We start by describing some principles for encoding data. There are several approaches at our disposal including position, aligned lengths, angles, area, brightness, and color hue.

To illustrate how some of these strategies compare let’s suppose we want to report the results from two hypothetical polls regarding browser preference taken in 2000 and then 2015. Here, for each year, we are simply comparing five quantities, five percentages.

A widely used graphical representation of percentages, popularized by Microsoft Excel, is the pie chart:

Here we are representing quantities with both areas and angles since both the angle and area of each pie slice is proportional to the quantity it represents. This turns out to be a suboptimal choice since, as demonstrated by perception studies, humans are not good at precisely quantifying angles and are even worse when only area is available.

The donut chart is an example of a plot that uses only area:

To see how hard it is to quantify angles, note that the rankings and all the percentages in the plots above changed from 2000 to 2015. Can you determine the actual percentages and rank the browsers’ popularity? Can you see how the percentages changed from 2000 to 2015? It is not easy to tell from the plot.

In fact, the pie R function help file states:

“Note: Pie charts are a very bad way of displaying information. The eye is good at judging linear measures and bad at judging relative areas. A bar chart or dot chart is a preferable way of displaying this type of data.”

In this case, simply showing the numbers is not only clearer, but it would save on print cost if making a paper version.

Browser 2000 2015
Opera 3 2
Safari 21 22
Firefox 23 21
Chrome 26 29
IE 28 27

The preferred way to plot quantities is to use length and position since humans are much better at judging linear measure. The bar plot uses bars of length proportional to the quantities of interest. By adding horizontal lines at strategically chosen values, in this case at every multiple of 10, we ease the quantifying through the position of the top of the bars.

p2 <-browsers %>%
  ggplot(aes(Browser, Percentage)) + 
  geom_bar(stat = "identity", width=0.5, fill=4, col = 1) +
  ylab("Percent using the Browser") +
  facet_grid(.~Year)
grid.arrange(p1, p2, nrow = 2)

Notice how much easier it is to see the differences in the barplot. In fact, we can now determine the actual percentages by following a horizontal line to the x-axis.

If for some reason you need to make a pie chart, do include the percentages as numbers to avoid having to infer them from the angles or area:

In general, position and length are the preferred ways to display quantities over angles which are preferred to area.

Brightness and color are even harder to quantify than angles and area but, as we will see later, they are sometimes useful when more than two dimensions are being displayed.

Know when to include 0

When using barplots it is dishonest not to start the bars at 0. This is because, by using a barplot, we are implying the length is proportional to the quantities being displayed. By avoiding 0, relatively small differences can be made to look much bigger than they actually are. This approach is often used by politicians or media organizations trying to exaggerate a difference.

Here is an illustrative example:

(Source: Fox News) via Media Matters.

From the plot above, it appears that apprehensions have almost tripled when in fact they have only increased by about 16%. Starting the graph at 0 illustrates this clearly:

Here is another example, described in detail here, which makes a 4.6% increase look like a five fold change.

Here is the appropriate plot:

When using position rather than length, it is not necessary to include 0. This is particularly the case when we want to compare differences between groups relative to the variability seen within the groups.

Here is an illustrative example showing country average life expectancy stratified by continent in 2012:

The space between 0 and 43 in the plot on the left adds no information and makes it harder to appreciate the between and within variability. Here, 0 should not be included.

Do not distort quantities

During President Barack Obama’s 2011 State of the Union Address the following chart was used to compare the US GDP to the GDP of four competing nations:

(Source: The 2011 State of the Union Address)

Note judging by the area of the circles the US appears to have an economy over five times larger than China and over 30 times larger than France. However, when looking at the actual numbers one sees that this is not the case. The actual ratios are 2.6 and 5.8 times bigger than China and France respectively. The reason for this distortion is that the radius, rather than the area, was made to be proportional to the quantity which implies that the proportion between the areas is squared: 2.6 turns into 6.5 and 5.8 turns into 34.1. Proportional to the radius compared to proportional to area:

Not surprisingly, ggplot defaults to using area rather than radius. Of course, in this case, we really should not be using area at all since we can use position and length:

Order by a meaningful value

When one of the axes is used to show categories, as is done in barplots, the default ggplot behavior is to order the categories alphabetically when they are defined by character strings. If they are defined by factors, they are ordered by the factor levels. We rarely want to use alphabetical order. Instead we should order by a meaningful quantity.

In all the cases above, the barplots where ordered by the values being displayed. The exception was the graph showing barplots comparing browsers. In this case we kept the order the same across the barplots to ease the comparison. We ordered by the average value of 2000 and 2015. We previously learned how to use the reorder function, which helps achieve this goal.

To appreciate how the right order can help convey a message, suppose we want to create a plot to compare the murder rate across states. We are particularly interested in the most dangerous and safest states. Note the difference when we order alphabetically (the default) versus when we order by the actual rate:

data(murders)
p1 <- murders %>% mutate(murder_rate = total / population * 100000) %>%
  ggplot(aes(state, murder_rate)) +
  geom_bar(stat="identity") +
  coord_flip() +
  xlab("")

p2 <- murders %>% mutate(murder_rate = total / population * 100000) %>%
  mutate(state = reorder(state, murder_rate)) %>%
  ggplot(aes(state, murder_rate)) +
  geom_bar(stat="identity") +
  coord_flip() +
  xlab("")
grid.arrange(p1, p2, ncol = 2)

Note that the reorder function lets us reorder groups as well.

Below is an example we saw earlier with and without reorder. The first orders the regions alphabetically while the second orders them by the group’s median.

Show the data

We have focused on displaying single quantities across categories. We now shift our attention to displaying data, with a focus on comparing groups.

To motivate our first principle, we’ll use our heights data. A commonly seen plot used for comparisons between groups, popularized by software such as Microsoft Excel, shows the average and standard errors (standard errors are defined in a later lecture, but don’t confuse them with the standard deviation of the data).

The plot looks like this:

The average of each group is represented by the top of each bar and the antennae expand to the average plus two standard errors. If all someone receives is this plot they will have little information on what to expect if they meet a group of human males and females. The bars go to 0, does this mean there are tiny humans measuring less than one foot? Are all males taller than the tallest females? Is there a range of heights? Someone can’t answer these questions since we have provided almost no information on the height distribution.

This brings us to our first principle: show the data. This simple ggplot code already generates a more informative plot than the barplot by simply showing all the data points:

heights %>% ggplot(aes(sex, height)) + geom_point() 

For example, we get an idea of the range of the data. However, this plot has limitations as well since we can’t really see all the 238 and 812 points plotted for females and males respectively, and many points are plotted on top of each other. As we have described, visualizing the distribution is much more informative. But before doing this, we point out two ways we can improve a plot showing all the points.

The first is to add jitter: adding a small random shift to each point. In this case, adding horizontal jitter does not alter the interpretation, since the height of the points do not change, but we minimize the number of points that fall on top of each other and therefore get a better sense of how the data is distributed.

A second improvement comes from using alpha blending: making the points somewhat transparent. The more points fall on top of each other, the darker the plot which also helps us get a sense of how the points are distributed.

Here is the same plot with jitter and alpha blending:

heights %>% ggplot(aes(sex, height)) + 
  geom_jitter(width = 0.1, alpha = 0.2) 

Now we start getting a sense that, on average, males are taller than females. We also note dark horizontal lines demonstrating that many reported values are rounded to the nearest integer. Since there are so many points it is more effective to show distributions, rather than show individual points. In our next example we show the improvements provided by distributions and suggest further principles.

Ease comparisons: Use common axes

Earlier we saw this plot used to compare male and female heights:

Since there are so many points it is more effective to show distributions, rather than show individual points. We therefore show histograms for each group:

## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

However, from this plot it is not immediately obvious that males are, on average, taller than females. We have to look carefully to notice that the x-axis has a higher range of values in the male histogram. An important principle here is to keep the axes the same when comparing data across to plots.

Note how the comparison becomes easier:

Align plots vertically to see horizontal changes and horizontally to see vertical changes. In these histograms, the visual cue related to decreases or increases in height are shifts to the left or right respectively: horizontal changes. Aligning the plots vertically helps us see this change when the axis are fixed:

This plot makes it much easier to notice that men are, on average, taller. If instead of histograms we want the more compact summary provided by boxplots, then we align them horizontally, since, by default, boxplots move up and down with changes in height.

Following our show the data principle we overlay all the data points:

Now contrast and compare these three plots, based on exactly the same data:

Note how much more we learn from the two plots on the right. Barplots are useful for showing one number, but not very useful when wanting to describe distributions.

Consider transformations

We have motivated the use the log transformation in cases where the changes are multiplicative. Population size was an example in which we found a log transformation to yield a more informative transformation. The combination of incorrectly using barplots when a log transformation is merited can be particularly distorting.

As an example, consider this barplot showing the average population sizes for each continent in 2015:

From this plot one would conclude that countries in Asia are much more populous than other continents. Following the show the data principle we quickly notice that this is due to two very large countries, which we assume are India and China:

Here, using a log transformation provides a much more informative plot. We compare the original barplot to a boxplot using the log scale transformation for the y-axis:

Note in particular that with the new plot we realize that countries in Africa actually have a larger median population size than those in Asia.

Other transformations you should consider are the logistic transformation, useful to better see fold changes in odds, and the square root transformation, useful for count data.

Adjacent visual cues

When comparing income data between 1970 and 2010 across region we made a figure similar to the one below. A difference is that here we look at continents instead of regions, but this is not relevant to the point we are making.

Note that, for each continent, we want to compare the distributions from 1970 to 2010. The default in ggplot is to order alphabetically so the labels with 1970 come before the labels with 2010, making the comparisons challenging.

Comparison is even easier when boxplots are next to each other:

Ease comparison: use color

Comparison is easier when color is used to denote the two things compared.

Think of the color blind

About 10% of the population is color blind. Unfortunately, the default colors used in ggplot are not optimal for this group. However, ggplot does make it easy to change the color palette used in plots. Here is an example of how we can use a color blind friendly pallet. There are several resources that help you select colors, for example this one.

color_blind_friendly_cols <- c("#999999", "#E69F00", "#56B4E9", 
                               "#009E73", "#F0E442", "#0072B2", 
                               "#D55E00", "#CC79A7")
p1 <- data.frame(x=1:8, y=1:8, col = as.character(1:8)) %>% 
  ggplot(aes(x, y, color = col)) + geom_point(size=5)
p1 + scale_color_manual(values=color_blind_friendly_cols)

Scatter plots

Use scatter plots to examine the relationship between two continuous variables. In every single instance in which we have examined the relationship between two continuous variables, total murders versus population size, life expectancy versus fertility rates, and child mortality versus income, we have used scatter plots. This is the plot we generally recommend.

Slope charts

One exception where another type of plot may be more informative is when you are comparing variables of the same type but at different time points and for a relatively small number of comparisons. For example, comparing life expectancy between 2010 and 2015. In this case we might recommend a slope chart. There is not a geometry for slope chart in ggplot2 but we can construct one using geom_lines. We need to do some tinkering to add labels.

Here is a comparison for large western countries:

west <- c("Western Europe","Northern Europe","Southern Europe",
          "Northern America","Australia and New Zealand")
dat <- gapminder %>% 
  filter(year%in% c(2010, 2015) & region %in% west & 
           !is.na(life_expectancy) & population > 10^7) 
dat %>%
  mutate(location = ifelse(year == 2010, 1, 2), 
         location = ifelse(year == 2015 & 
         country%in%c("United Kingdom","Portugal"), 
         location+0.22, location),
         hjust = ifelse(year == 2010, 1, 0)) %>%
  mutate(year = as.factor(year)) %>%
  ggplot(aes(year, life_expectancy, group = country)) +
  geom_line(aes(color = country), show.legend = FALSE) +
  geom_text(aes(x = location, label = country, hjust = hjust), 
            show.legend = FALSE) +
  xlab("") + ylab("Life Expectancy")

An advantage of the slope chart is that it permits us to quickly get an idea of changes based on the slope of the lines. Note that we are using angle as the visual cue. But we also have position to determine the exact values. Comparing the improvements is a bit harder with a scatter plot:

Use common charts

Note that in the scatter plot we have followed the principle use common axes since we are comparing these before and after. However, if we have many points the slope charts stop being useful as it becomes hard to see each line.

Bland-Altman plot

Since we are interested in the difference, it makes sense to dedicate one of our axes to it. The Bland-Altman plot, also known as the Tukey mean-difference plot and the MA-plot, shows the difference versus the average:

Here we quicky see which countries have improved the most as it is represented by the y-axis. We also get an idea of the overall value from the x-axis.

Encoding a third variable

We previously showed a scatter plot showing the relationship between infant survival and average income. Here is a version of this plot where we encode three variables: OPEC membership, region, and population:

Note that we encode categorical variables with color, hue, and shape. The shape can be controlled with the shape argument. Below are the shapes available for use in R. Note that for the last five, the color goes inside.

The default shape values are a circle and a triangle for OPEC membership. We can manually customize these by adding the layer scale_shape_manual(values = c(8, 10)), where 8 and 10 are the numbers of the desired shapes from the list above.

dat %>% 
  ggplot(aes(dollars_per_day, 1 - infant_mortality/1000, 
             col = region, size = population/10^6,
             shape =  OPEC)) +
  scale_x_continuous(trans = "log2", limits=c(0.25, 150)) +
  scale_y_continuous(trans = "logit",limit=c(0.875, .9981),
                     breaks=c(.85,.90,.95,.99,.995,.998)) + 
  geom_point(alpha = 0.5) +
  scale_shape_manual(values = c(8, 10))

For continuous variables we can use color, intensity or size. We now show an example of how we do this with a case study.

Case Study: Vaccines

Vaccines have helped save millions of lives. In the 19th century, before herd immunization was achieved through vaccination programs, deaths from infectious diseases, like smallpox and polio, were common. However, today, despite all the scientific evidence for their importance, vaccination programs have become somewhat controversial.

library(dslabs)
data(us_contagious_diseases)
str(us_contagious_diseases)
## 'data.frame':    16065 obs. of  6 variables:
##  $ disease        : Factor w/ 7 levels "Hepatitis A",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ state          : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year           : num  1966 1967 1968 1969 1970 ...
##  $ weeks_reporting: num  50 49 52 49 51 51 45 45 45 46 ...
##  $ count          : num  321 291 314 380 413 378 342 467 244 286 ...
##  $ population     : num  3345787 3364130 3386068 3412450 3444165 ...

We create a temporary object dat that stores only the Measles data, includes a per 100,000 rate, orders states by average value of disease and removes Alaska and Hawaii since they only became states in the late 1950s.

the_disease <- "Measles"
dat <- us_contagious_diseases %>%
       filter(!state %in% c("Hawaii","Alaska") & disease == the_disease) %>%
       mutate(rate = (count / weeks_reporting) * 52 / (population / 100000)) 

Can we show data for all states in one plot? We have three variables to show: year, state and rate.

When choosing colors to quantify a numeric variable we chose between two options: sequential and diverging. Sequential colors are suited for data that goes from high to low. High values are clearly distinguished from low values. Here are some examples offered by the package RColorBrewer:

library(RColorBrewer)
display.brewer.all(type = "seq")

Diverging colors are used to represent values that diverge from a center. We put equal emphasis on both ends of the data range: higher than the center and lower than the center. An example of when we would use a divergent pattern would be if we were to show height in standard deviations away from the average. Here are some examples of divergent patterns:

library(RColorBrewer)
display.brewer.all(type="div")

In our example we want to use a sequential palette since there is no meaningful center, just low and high rates.

geom tile

We use the geometry geom_tile to tile the region with colors representing disease rates. We use a square root transformation to avoid having the really high counts dominate the plot.

dat %>% ggplot(aes(year, state,  fill = rate)) +
        geom_tile(color = "grey50") +
        scale_x_continuous(expand = c(0,0)) +
        scale_fill_gradientn(colors = brewer.pal(9, "Reds"), trans = "sqrt") +
        geom_vline(xintercept = 1963, col = "blue") +
        theme_minimal() +  theme(panel.grid = element_blank()) +
        ggtitle(the_disease) + 
        ylab("") + 
        xlab("")

This plot makes a very striking argument for the contribution of vaccines. However, one limitation of this plot is that it uses color to represent quantity which we earlier explained makes it a bit harder to know exactly how high it is going. Position and length are better cues. If we are willing to lose state information, we can make a version of the plot that shows the values with position.

Avoid pseudo 3D plots

The figure below, taken from the scientific literature1, shows three variables: dose, drug type and survival. Although your screen/book page is flat and two-dimensional, the plot tries to imitate three dimensions and assigned a dimension to each variable.

(Image courtesy of Karl Broman)

Humans are not good at seeing in three dimensions (which explains why it is hard to parallel park) and our limitation is even worse with regard to pseudo-three-dimensions. To see this, try to determine the values of the survival variable in the plot above. Can you tell when the purple ribbon intersects the red one? This is an example in which we can easily use color to represent the categorical variable instead of using a pseudo-3D:

Notice how much easier it is to determine the survival values.

Pseudo-3D is sometimes used completely gratuitously: plots are made to look 3D even when the 3rd dimension does not represent a quantity. This only adds confusion and makes it harder to relay your message. Here are two examples:

(Images courtesy of Karl Broman)

Avoid too many significant digits

By default, statistical software like R returns many significant digits. The default behavior in R is to show 7 significant digits. That many digits often adds no information and the added visual clutter can make it hard for the viewer to understand the message. As an example, here are the per 10,000 disease rates, computed from totals and population in R, for California across the five decades:

state year Measles Pertussis Polio
California 1940 37.8826320 18.3397861 0.8266512
California 1950 13.9124205 4.7467350 1.9742639
California 1960 14.1386471 NA 0.2640419
California 1970 0.9767889 NA NA
California 1980 0.3743467 0.0515466 NA

We are reporting precision up to 0.00001 cases per 10,000, a very small value in the context of the changes that are occurring across the dates. In this case, two significant figures is more than enough and clearly makes the point that rates are decreasing:

state year Measles Pertussis Polio
California 1940 37.9 18.3 0.8
California 1950 13.9 4.7 2.0
California 1960 14.1 NA 0.3
California 1970 1.0 NA NA
California 1980 0.4 0.1 NA

Useful ways to change the number of significant digits or to round numbers are signif and round. You can define the number of significant digits globally by setting options like this: options(digits = 3).

Another principle related to displaying tables is to place values being compared on columns rather than rows. Note that our table above is easier to read than this one:

state disease 1940 1950 1960 1970 1980
California Measles 37.9 13.9 14.1 1 0.4
California Pertussis 18.3 4.7 NA NA 0.1
California Polio 0.8 2.0 0.3 NA NA

Know your audience

Graphs can be used for 1) our own exploratory data analysis, 2) to convey a message to experts, or 3) to help tell a story to a general audience. Make sure that the intended audience understands each element of the plot.

As a simple example, consider that for your own exploration it may be more useful to log-transform data and then plot it. However, for a general audience that is unfamiliar with converting logged values back to the original measurements, using a log-scale for the axis instead of log-transformed values will be much easier to digest.


  1. https://projecteuclid.org/download/pdf_1/euclid.ss/1177010488↩︎